move the LLM instance directly to Assistant to make it cleaner to share with tests by bcherry · Pull Request #71 · livekit-examples/agent-starter-python

bcherry · 2026-04-30T04:32:47Z

this gets rid of the awkward AGENT_MODEL constant by just making the LLM an inherent property of the Assistant, which seems more intuitive

I also switched the test judge model to base 5.2 instead of the chat version to make it clearer that you can (and should) use a different model for evals than for core chat.

I'd like to move STT and TTS as well, but I found a bug we need to fix first

also see livekit-examples/agent-starter-node#43

u9g · 2026-04-30T11:30:35Z

I also switched the test judge model to base 5.2 instead of the chat version to make it clearer that you can (and should) use a different model for evals than for core chat.

You did this by changing to 5.2 rather than 4.1-mini, I'm confused what you mean here, since the judge model was already discriminated from the agent model, and if anything you don't want a heavy judge model that will slow down every single test since IIRC we don't run tests in parallel by default.

…-sharing

bcherry · 2026-05-04T21:08:19Z

I also switched the test judge model to base 5.2 instead of the chat version to make it clearer that you can (and should) use a different model for evals than for core chat.

You did this by changing to 5.2 rather than 4.1-mini, I'm confused what you mean here, since the judge model was already discriminated from the agent model, and if anything you don't want a heavy judge model that will slow down every single test since IIRC we don't run tests in parallel by default.

@u9g Ah I got confused because I read the original PR into the python example and it used the same model in both places, whereas the original node PR had separate ones. I didn't look at the current state of the code before asking Claude to do it so I missed that it already had separated it which is good

I changed it back to 4.1-mini for now (will do so in node too) but I'll say that I'm not really sure whether it makes sense to eval with a small model. this is something we should probably develop some benchmarks around. on a first-principles basis, I think I'd prefer to trust a larger model for evals than the one used in the actual conversation. in conversation, the focus needs to be on latency. outside of conversation, I am willing to spend longer really determining whether the conversation was good. this is maybe more of a concern with real session evals than with in-codebase synthetic evals like we have here.

…-sharing

u9g · 2026-05-05T00:55:50Z

I also switched the test judge model to base 5.2 instead of the chat version to make it clearer that you can (and should) use a different model for evals than for core chat.

You did this by changing to 5.2 rather than 4.1-mini, I'm confused what you mean here, since the judge model was already discriminated from the agent model, and if anything you don't want a heavy judge model that will slow down every single test since IIRC we don't run tests in parallel by default.

@u9g Ah I got confused because I read the original PR into the python example and it used the same model in both places, whereas the original node PR had separate ones. I didn't look at the current state of the code before asking Claude to do it so I missed that it already had separated it which is good

I changed it back to 4.1-mini for now (will do so in node too) but I'll say that I'm not really sure whether it makes sense to eval with a small model. this is something we should probably develop some benchmarks around. on a first-principles basis, I think I'd prefer to trust a larger model for evals than the one used in the actual conversation. in conversation, the focus needs to be on latency. outside of conversation, I am willing to spend longer really determining whether the conversation was good. this is maybe more of a concern with real session evals than with in-codebase synthetic evals like we have here.

Yeah, as with anything it makes sense to benchmark. If we are changing to a bigger model for judge, it might be worth considering parallel testing in the templates.

bcherry added 3 commits April 29, 2026 21:08

Attach models to agent subclass

ade8c3e

5.2

bdcd506

Fix up

41bf843

bcherry requested review from Topherhindman and u9g April 30, 2026 04:32

comment

6f9a968

bcherry mentioned this pull request Apr 30, 2026

move the LLM instance directly to Assistant to make it cleaner to share with tests livekit-examples/agent-starter-node#43

Open

ruff

dce914d

bcherry added 2 commits May 4, 2026 14:03

Merge remote-tracking branch 'origin/main' into bcherry/cleaner-model…

c9dcf58

…-sharing

4.1-mini

94eb7ca

Merge remote-tracking branch 'origin/main' into bcherry/cleaner-model…

b8fce5b

…-sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move the LLM instance directly to Assistant to make it cleaner to share with tests#71

move the LLM instance directly to Assistant to make it cleaner to share with tests#71
bcherry wants to merge 8 commits intomainfrom
bcherry/cleaner-model-sharing

bcherry commented Apr 30, 2026 •

edited

Loading

Uh oh!

u9g commented Apr 30, 2026

Uh oh!

bcherry commented May 4, 2026

Uh oh!

u9g commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bcherry commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

u9g commented Apr 30, 2026

Uh oh!

bcherry commented May 4, 2026

Uh oh!

u9g commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bcherry commented Apr 30, 2026 •

edited

Loading